AITopics | Fortaleza

Collaborating Authors

Fortaleza

Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport

Eufrazio, Rafael Pereira, Montesuma, Eduardo Fernandes, Cavalcante, Charles Casimiro

arXiv.org Machine LearningApr-6-2026

Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Classical approaches often rely on feature concatenation or explicit alignment assumptions, which become restrictive under heterogeneous geometries or nonlinear distortions. In this work, we propose two geometry-aware multi-view embedding strategies grounded in Gromov-Wasserstein (GW) optimal transport. The first, termed Mean-GWMDS, aggregates view-specific relational information by averaging distance matrices and applying GW-based multidimensional scaling to obtain a representative embedding. The second strategy, referred to as Multi-GWMDS, adopts a selection-based paradigm in which multiple geometry-consistent candidate embeddings are generated via GW-based alignment and a representative embedding is selected. Experiments on synthetic manifolds and real-world datasets show that the proposed methods effectively preserve intrinsic relational structure across views. These results highlight GW-based approaches as a flexible and principled framework for multi-view representation learning.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Machine Learning

2604.0261

Country:

South America > Brazil > Ceará > Fortaleza (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre:

Research Report (0.50)
Overview (0.46)

Industry: Energy (0.30)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

Caminha, Carlos, Silva, Maria de Lourdes M., Chaves, Iago C., Brito, Felipe T., Farias, Victor A. E., Machado, Javam C.

arXiv.org Artificial IntelligenceJul-2-2025

Computer manufacturers offer platforms for users to describe device faults using textual reports such as "My screen is flickering". Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). W e conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. W e achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.00742

Country: South America > Brazil > Ceará > Fortaleza (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (0.68)
Information Technology > Hardware (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DataMan: Data Manager for Pre-training Large Language Models

Peng, Ru, Yang, Kexin, Zeng, Yawen, Lin, Junyang, Liu, Dayiheng, Zhao, Junbo

arXiv.org Artificial IntelligenceMar-13-2025

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

conference paper, consistency, criteria, (15 more...)

arXiv.org Artificial Intelligence

2502.19363

Country:

South America > Venezuela (0.14)
Europe > Norway (0.13)
North America > United States > Alabama (0.04)
(49 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Sports (1.00)
Law (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

Generative Active Adaptation for Drifting and Imbalanced Network Intrusion Detection

Gupta, Ragini, Liu, Shinan, Zhang, Ruixiao, Hu, Xinyue, Kommaraju, Pranav, Wang, Xiaoyang, Benkraouda, Hadjer, Feamster, Nick, Nahrstedt, Klara

arXiv.org Artificial IntelligenceMar-4-2025

Machine learning has shown promise in network intrusion detection systems, yet its performance often degrades due to concept drift and imbalanced data. These challenges are compounded by the labor-intensive process of labeling network traffic, especially when dealing with evolving and rare attack types, which makes selecting the right data for adaptation difficult. To address these issues, we propose a generative active adaptation framework that minimizes labeling effort while enhancing model robustness. Our approach employs density-aware active sampling to identify the most informative samples for annotation and leverages deep generative models to synthesize diverse samples, thereby augmenting the training set and mitigating the effects of concept drift. We evaluate our end-to-end framework on both simulated IDS data and a real-world ISP dataset, demonstrating significant improvements in intrusion detection performance. Our method boosts the overall F1-score from 0.60 (without adaptation) to 0.86. Rare attacks such as Infiltration, Web Attack, and FTP-BruteForce, which originally achieve F1 scores of 0.001, 0.04, and 0.00, improve to 0.30, 0.50, and 0.71, respectively, with generative active adaptation in the CIC-IDS 2018 dataset. Our framework effectively enhances rare attack detection while reducing labeling costs, making it a scalable and adaptive solution for real-world intrusion detection.

augmentation, dataset, netguard, (15 more...)

arXiv.org Artificial Intelligence

2503.03022

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Illinois > Cook County > Chicago (0.04)
(12 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Data denoising with self consistency, variance maximization, and the Kantorovich dominance

Hiew, Joshua Zoen-Git, Lim, Tongseok, Pass, Brendan, de Souza, Marcelo Cruz

arXiv.org Artificial IntelligenceFeb-5-2025

We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem.

artificial intelligence, machine learning, var, (16 more...)

arXiv.org Artificial Intelligence

2502.02925

Country:

South America > Brazil > Ceará > Fortaleza (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Text-to-SQL based on Large Language Models and Database Keyword Search

Nascimento, Eduardo R., Avila, Caio Viktor S., Izquierdo, Yenier T., García, Grettel M., Andrade, Lucas Feijó L., Facina, Michelle S. P., Lemos, Melissa, Casanova, Marco A.

arXiv.org Artificial IntelligenceJan-23-2025

Text-to-SQL prompt strategies based on Large Language Models (LLMs) achieve remarkable performance on well-known benchmarks. However, when applied to real-world databases, their performance is significantly less than for these benchmarks, especially for Natural Language (NL) questions requiring complex filters and joins to be processed. This paper then proposes a strategy to compile NL questions into SQL queries that incorporates a dynamic few-shot examples strategy and leverages the services provided by a database keyword search (KwS) platform. The paper details how the precision and recall of the schema-linking process are improved with the help of the examples provided and the keyword-matching service that the KwS platform offers. Then, it shows how the KwS platform can be used to synthesize a view that captures the joins required to process an input NL question and thereby simplify the SQL query compilation step. The paper includes experiments with a real-world relational database to assess the performance of the proposed strategy. The experiments suggest that the strategy achieves an accuracy on the real-world relational database that surpasses state-of-the-art approaches. The paper concludes by discussing the results obtained.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2501.13594

Country:

South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
North America > United States (0.04)
Europe > Switzerland (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.46)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Novel End-To-End Event Geolocation Method Leveraging Hyperbolic Space and Toponym Hierarchies

Qiao, Yaqiong, Huang, Guojun

arXiv.org Artificial IntelligenceDec-14-2024

Abstract: Timely detection and geolocation of events based on social data can provide critical information for applications such as crisis response and resource allocation. However, most existing methods are greatly affected by event detection errors, leading to insufficient geolocation accuracy. To this end, this paper proposes a novel end-to-end event geolocation method (GTOP) leveraging Hyperbolic space and toponym hierarchies. Specifically, the proposed method contains one event detection module and one geolocation module. The event detection module constructs a heterogeneous information networks based on social data, and then constructs a homogeneous message graph and combines it with the text and time feature of the message to learning initial features of nodes. Node features are updated in Hyperbolic space and then fed into a classifier for event detection. To reduce the geolocation error, this paper proposes a noise toponym filtering algorithm (HIST) based on the hierarchical structure of toponyms. HIST analyzes the hierarchical structure of toponyms mentioned in the event cluster, taking the highly frequent city-level locations as the coarsegrained locations for events. To further improve the geolocation accuracy, we propose a fine-grained pseudo toponyms generation algorithm (FIT) based on the output of HIST, and combine generated pseudo toponyms with filtered toponyms to locate events based on the geographic center points of the combined toponyms. Extensive experiments are conducted on the Chinese dataset constructed in this paper and another public English dataset. The experimental results show that the proposed method is superior to the state-of-the-art baselines.

data mining, machine learning, toponym, (19 more...)

arXiv.org Artificial Intelligence

2412.1087

Country:

North America > United States > Texas > Dallas County > Dallas (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
South America > Brazil > Ceará > Fortaleza (0.04)
(15 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology (0.68)
Law (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

Integer linear programming for unsupervised training set selection in molecular machine learning

Haeberle, Matthieu, van Gerwen, Puck, Laplaza, Ruben, Briling, Ksenia R., Weinreich, Jan, Eisenbrand, Friedrich, Corminboeuf, Clemence

arXiv.org Artificial IntelligenceOct-21-2024

Integer linear programming (ILP) is an elegant approach to solve linear optimization problems, naturally described using integer decision variables. Within the context of physics-inspired machine learning applied to chemistry, we demonstrate the relevance of an ILP formulation to select molecular training sets for predictions of size-extensive properties. We show that our algorithm outperforms existing unsupervised training set selection approaches, especially when predicting properties of molecules larger than those present in the training set. We argue that the reason for the improved performance is due to the selection that is based on the notion of local similarity (i.e., per-atom) and a unique ILP approach that finds optimal solutions efficiently. Altogether, this work provides a practical algorithm to improve the performance of physics-inspired machine learning models and offers insights into the conceptual differences with existing training set selection approaches.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Artificial Intelligence

2410.16122

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Switzerland > Vaud > Lausanne (0.05)
(4 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.51)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

An explainable approach to detect case law on housing and eviction issues within the HUDOC database

Mohammadi, Mohammad, Wieling, Martijn, Vols, Michel

arXiv.org Artificial IntelligenceOct-3-2024

Case law is instrumental in shaping our understanding of human rights, including the right to adequate housing. The HUDOC database provides access to the textual content of case law from the European Court of Human Rights (ECtHR), along with some metadata. While this metadata includes valuable information, such as the application number and the articles addressed in a case, it often lacks detailed substantive insights, such as the specific issues a case covers. This underscores the need for detailed analysis to extract such information. However, given the size of the database - containing over 40,000 cases - an automated solution is essential. In this study, we focus on the right to adequate housing and aim to build models to detect cases related to housing and eviction issues. Our experiments show that the resulting models not only provide performance comparable to more sophisticated approaches but are also interpretable, offering explanations for their decisions by highlighting the most influential words. The application of these models led to the identification of new cases that were initially overlooked during data collection. This suggests that NLP approaches can be effectively applied to categorise case law based on the specific issues they address.

arXiv.org Artificial Intelligence

2410.02978

Country:

Europe > Netherlands (0.14)
Europe > Jersey (0.14)
Europe > Latvia (0.04)
(4 more...)

Genre: Research Report (0.70)

Industry:

Law > International Law (0.66)
Law > Civil Rights & Constitutional Law (0.56)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
(2 more...)

Add feedback

Fair Railway Network Design

He, Zixu, Botan, Sirin, Lang, Jérôme, Saffidine, Abdallah, Sikora, Florian, Workman, Silas

arXiv.org Artificial IntelligenceSep-3-2024

When designing a public transportation network in a country, one may want to minimise the sum of travel duration of all inhabitants. This corresponds to a purely utilitarian view and does not involve any fairness consideration, as the resulting network will typically benefit the capital city and/or large central cities while leaving some peripheral cities behind. On the other hand, a more egalitarian view will allow some people to travel between peripheral cities without having to go through a central city. We define a model, propose algorithms for computing solution networks, and report on experiments based on real data.

algorithm, budget, gini index, (9 more...)

arXiv.org Artificial Intelligence

2409.02152

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)
(64 more...)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Rail (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Communications > Networks (0.83)

Add feedback